{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas借助Python爬虫读取HTML网页表格存储到Excel文件\n",
    "\n",
    "实现目标：\n",
    "* 网易有道词典可以用于英语单词查询，可以将查询的单词加入到单词本;\n",
    "* 当前没有导出全部单词列表的功能。为了复习方便，可以爬取所有的单词列表，存入Excel方便复习\n",
    "\n",
    "涉及技术：\n",
    "* Pandas：Python语言最强大的数据处理和数据分析库\n",
    "* Python爬虫：可以将网页下载下来然后解析，使用requests库实现，需要绕过登录验证\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import requests.cookies\n",
    "import json\n",
    "import time\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 0. 处理流程"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>输入网页：有道词典-单词本</h4>\n",
    "<img src=\"./course_datas/c32_read_html/youdao_cidian.png\" style=\"width:50%; margin-left:0px;\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>处理流程</h4>\n",
    "<img src=\"./course_datas/c32_read_html/ppt_flow.png\" style=\"width:70%; margin-left:0px;\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4>数据结果到Excel文件（方便打印复习）：</h4>\n",
    "<img src=\"./course_datas/c32_read_html/output_excel.png\" style=\"width:70%; margin-left:0px;\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. 登录网易有道词典的PC版，微信扫码登录，复制cookies到文件\n",
    "\n",
    "* PC版地址：http://dict.youdao.com/  \n",
    "* Chrome插件可以复制Cookies为Json格式：http://www.editthiscookie.com/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "cookie_jar = requests.cookies.RequestsCookieJar()\n",
    "\n",
    "with open(\"./course_datas/c32_read_html/cookie.txt\") as fin:\n",
    "    cookiejson = json.loads(fin.read())\n",
    "    for cookie in cookiejson:\n",
    "        cookie_jar.set(\n",
    "            name=cookie[\"name\"],\n",
    "            value=cookie[\"value\"],\n",
    "            domain=cookie[\"domain\"],\n",
    "            path=cookie[\"path\"]\n",
    "        )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<RequestsCookieJar[Cookie(version=0, name='DICT_LOGIN', value='3||1578922508302', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='DICT_PERS', value='v2|weixin||DICT||web||2592000000||1578922508299||114.244.161.198||wxoXQUDj_FtHSw23tfJWsboPkq38ok||gFnMeLRLQLRpBOMYMhf6LRUf0Mz5P4TLRqSOM6uhfY5RzW0L6ZhHTB0kGRHeukLg40QZOMOMkMwu0gBkfJF0LTL0', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='DICT_SESS', value='v2|odmTRIUgTmgz6MlEOMqB0TBnfk5h4pZ0Py0MeBP4Q40qynHeuPMOWRpLPMY5RHJuRQykfJBOLQBRPKO4YYOLquR6zhLwBnMYMR', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='DICT_UGC', value='be3af0da19b5c5e6aa4e17bd8d90b28a|', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='JSESSIONID', value='abc46uQPL03Au_P0ghF_w', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='OUTFOX_SEARCH_USER_ID', value='\"1678365514@10.108.160.18\"', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='OUTFOX_SEARCH_USER_ID_NCOO', value='1349541628.6994112', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='ACCSESSIONID', value='8F00E30693F3BD052C9A4F293394BE0A', port=None, port_specified=False, domain='dict.youdao.com', domain_specified=True, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='___rl__test__cookies', value='1578922438675', port=None, port_specified=False, domain='dict.youdao.com', domain_specified=True, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False)]>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cookie_jar"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. 将html都下载下来存入列表"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "**爬数据：第0页\n",
      "**爬数据：第1页\n",
      "**爬数据：第2页\n",
      "**爬数据：第3页\n",
      "**爬数据：第4页\n",
      "**爬数据：第5页\n"
     ]
    }
   ],
   "source": [
    "htmls = []\n",
    "url = \"http://dict.youdao.com/wordbook/wordlist?p={idx}&tags=\"\n",
    "for idx in range(6):\n",
    "    time.sleep(1)\n",
    "    print(\"**爬数据：第%d页\" % idx)\n",
    "    r = requests.get(url.format(idx=idx), cookies=cookie_jar)\n",
    "    htmls.append(r.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'<!doctype html>\\n<html>\\n<head>\\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\"/>\\n<title>有道单词本</title>\\n\\n<link rel=\"canonical\" href=\"http://dict.youdao.com/wordbook/\"/> \\n<meta name=\"Keywords\" content=\"单词本,web单词本,有道,词典,youdao\" />\\n<meta name=\"Description\" content=\"有道词典单词本\" />\\n<link rel=\"shortcut icon\" href=\"http://shared.ydstatic.com/images/favicon.ico?213\" type=\"image/x-icon\"/>\\n<link href=\"http://shared.ydstatic.com/r/1.0/s/g3.css?20110428\" rel=\"stylesheet\" type=\"text/css\"/>\\n<link type=\"text/css\" href=\"resources/styles/main.css\" rel=\"stylesheet\">\\n\\n<style type=\"text/css\">\\n\\n#f{background-image:url(http://shared.ydstatic.com/images/skins/default/skin-x.jpg)}\\n#fbl{background:url(http://shared.ydstatic.com/images/skins/default/skin_.jpg) left top}\\n#fbr{background:url(http://shared.ydstatic.com/images/skins/default/skin_.jpg) right -200px}\\n\\n</style>\\n<script type=\"text/javascript\">\\nvar VARIABLES={ \\n                tags:\"\",\\n                page:\"0\",\\n                sort:\"\",\\n                querystring:\"\"\\n        };\\n</script>\\n\\n\\n</head>\\n\\n<body>\\n\\n<div id=\"t\">\\n    <div id=\"u\">\\n                    <span id=\"un\">\\n        <span class=\"un_n\">晚上好，</span>\\n        <span id=\"mun\" class=\"un_box\"><b class=\"un_l\"><q></q></b><b class=\"un_r\"><q></q></b>\\n                 <span class=\"un_btn\"><b class=\"un_m\">&nbsp;<q></q></b>\\n               <span class=\"un_ml\">\\n                    wxoXQUDj_FtHSw23tfJWsboPkq38ok\\n                                  </span>\\n                                    </span>\\n            </span>\\n       </span>\\n                       <span class=\"sl\">|</span>\\n                            <a href=\"http://account.youdao.com/logout?service=dict&back_url=http%3A%2F%2Fdict.youdao.com%2Fwordbook%2Fwordlist\">登出</a>\\n            </div>\\n    <div id=\"n\">\\n        <a href=\"http://www.163.com/\" id=\"mn\" class=\"mn\" target=\"_blank\"><u>网易</u><s>▼</s></a>\\n        <span class=\"sl\">|</span>\\n        <a class=\"search-js\" data-product=\\'www\\' href=\"http://www.youdao.com\">网页</a>\\n        <a class=\"search-js\" data-product=\\'image\\' href=\"http://image.youdao.com\">图片</a>\\n        <a class=\"search-js\" data-product=\\'news\\' href=\"http://news.youdao.com\">热闻</a>\\n        <a class=\"search-js\" data-product=\\'gouwu\\' href=\"http://gouwu.youdao.com\">购物</a>\\n        <a class=\"search-js\" data-product=\\'dict\\' href=\"http://dict.youdao.com\">词典</a>\\n        <a class=\"search-js\" data-product=\\'fanyi\\' data-trans=\\'translate?i=\\' href=\"http://fanyi.youdao.com/\">翻译</a>\\n        <a class=\"search-js\" data-product=\\'note\\' href=\"http://note.youdao.com\">笔记</a>\\n        <strong>单词本</strong>\\n\\t<a class=\"mn\" target=\"_blank\" href=\"http://www.youdao.com/about/productlist.html\"><u>更多»</u></a>\\n    </div>\\n    </div>\\n\\n\\n<div id=\"ym\" class=\"pm\">\\n    <ul>\\n        <li><a href=\"http://video.youdao.com\" class=\"search-js\" data-product=\\'video\\'>视频</a></li>\\n        <li><a href=\"http://blog.youdao.com/\" class=\"search-js\" data-product=\\'blog\\'>博客</a></li>\\n        <li><a href=\"http://tie.youdao.com/\" class=\"search-js\" data-product=\\'tie\\'>快贴</a></li>\\n        <li><a href=\"http://ditu.youdao.com/\" class=\"search-js\" data-product=\\'ditu\\'>地图</a></li>\\n\\n        <li class=\"sl\"></li>\\n        <li><a href=\"http://reader.youdao.com\">阅读</a></li>\\n        <li><a href=\"http://m.youdao.com/help\">手机</a></li>\\n        <li><a href=\"http://shuqian.youdao.com\">书签</a></li>\\n        <li><a href=\"http://cidian.youdao.com\" class=\"search-js\" data-product=\\'cidian\\'>桌面词典</a></li>\\n        <li class=\"sl\"></li>\\n        <li><a href=\"http://www.youdao.com/about/productlist.html\">全部产品</a></li>\\n\\n    </ul>\\n</div>\\n<div id=\"nm\" class=\"pm\">\\n    <ul>\\n        <li><a href=\"http://www.163.com/\" target=\"_blank\">首页</a></li>\\n        <li><a href=\"http://news.163.com/\" target=\"_blank\">新闻</a></li>\\n        <li><a href=\"http://email.163.com/\" target=\"_blank\">邮箱</a></li>\\n        <li><a href=\"http://blog.163.com/\" target=\"_blank\">博客</a></li>\\n\\n        <li><a href=\"http://photo.163.com/\" target=\"_blank\">相册</a></li>\\n        <li><a href=\"http://nie.163.com/\" target=\"_blank\">游戏</a></li>\\n        <li class=\"sl\"></li>\\n        <li><a href=\"http://sitemap.163.com/\" target=\"_blank\">全部产品</a></li>\\n    </ul>\\n</div>\\n\\n\\n<!-- 图标与搜索框 -->\\n<form id=\"f\" method=\"get\" action=\"#\" name=\"sb\">\\n  <h1 id=\"yd\"><a href=\"/wordbook/wordlist\">有道单词本</a></h1>\\n   <!--<div id=\"ts\" class=\"fc\">\\n \\n    <div class=\"qc no-suggest\" id=\"qc\">\\n      <input name=\"tab\" value=\"chn\" type=\"hidden\">\\n      <input name=\"keyfrom\" value=\"shuqian.top\" type=\"hidden\">\\n      <input type=\"text\" class=\"q\" name=\"q\" id=\"query\" autocomplete=\"off\"  value=\"\"/>\\n    </div>\\n    <input type=\"submit\" value=\"搜 索\" class=\"qb\" name=\"btnSearchTag\"/>\\n    \\n  </div>-->\\n  <div class=\"ao\"></div>\\n  <div id=\"fbl\"> </div>\\n  <div id=\"fbr\"> </div>\\n</form> \\n \\n\\n<div id=\"wrapper\">\\n\\n\\n    <div id=\"top\" >\\n        \\n\\n                <a href=\"#\" id=\"addword\"></a>\\n\\n                \\n              \\n            <div style=\"width:500px;float:right;text-align:right;\">    \\n                <label for=\"select_category\">分类</label>\\n                <select id=\"select_category\">\\n                    <option value=\"\">全部分类</option>\\n                                            <option value=\"无标签\" >无标签 </option>\\n                                    </select>  \\n                        \\n        <a href=\"#\" id=\"toggle_listmode\" class=\"active\"></a><a href=\"#\" id=\"toggle_cardmode\" ></a>\\n        </div>\\n        <div class=\"clear\"></div>\\n\\n    </div>   \\n    \\n    <div id=\"listmode\">\\n               <div id=\"wordhead\">\\n            <table  width=\"100%\" style=\"table-layout:fixed;background:#fff;\">\\n                    <tr>\\n                        <th width=\"50px\">序号</th>\\n                        <th width=\"80px\">单词</th>\\n                        <th width=\"80px\">音标</th>\\n                        <th width=\"320px\">解释</th>\\n                       <!--  <th width=\"50px\">难度</th> -->\\n                        <th width=\"85px\">时间</th>\\n                        <th>分类</th>\\n                        <th width=\"65px\">操作</th>\\n                    </tr>\\n            </table>\\n        </div> \\n        \\n        <div id=\"wordlist\" >\\n            <table  width=\"100%\" style=\"table-layout:fixed\">\\n\\n                <tbody>\\n                                        <tr>\\n                        <td width=\"50px\"> 1</td>\\n                        <td width=\"80px\"><div class=\"word\"  title=\"agglomerative\"><a href=\"/search?keyfrom=webwordbook&q=agglomerative\"  target=\"_blank\"><strong>agglomerative</strong></a></div></td>\\n                        <td width=\"80px\"><div class=\"phonetic\"  title=\"\"></div></td>\\n                        <td width=\"320px\">\\n                            <div  class=\"desc\"  title=\"adj. 会凝聚的；[冶] 烧结的，凝结的\">adj. 会凝聚的；[冶] 烧结的，凝结的</div>\\n                        </td>\\n                        <!-- <td width=\"50px\">\\n                            <span class=\"flag\" style=\"display:none;\">0</span>\\n                            <span class=\"level\">\\n                                                        </span>\\n                        </td> -->\\n\\n                        <td width=\"85px\">2020-1-13</td>\\n                        <td >\\n                            <div  class=\"tags\" title=\"\"></div>\\n                        </td>\\n                        <td width=\"65px\" style=\"vertical-align:middle;\">\\n                            <a href=\"#\" class=\"editword\"  title=\"编辑agglomerative\" ></a>\\n                            \\n                           \\n                            <a href=\\n                                                        \"wordlist?action=delete&word=agglomerative&p=0\" \\n                                                        class=\"deleteword\" title=\"删除agglomerative\" onclick=\\'if(!confirm(\"您确定删除单词 agglomerative 吗？\")){ return false;}else return true;\\'></a>\\n                        </td>\\n                    </tr>\\n                                        <tr>\\n                        <td width=\"50px\"> 2</td>\\n                        <td width=\"80px\"><div class=\"word\"  title=\"anatomy\"><a href=\"/search?keyfrom=webwordbook&q=anatomy\"  target=\"_blank\"><strong>anatomy</strong></a></div></td>\\n                        <td width=\"80px\"><div class=\"phonetic\"  title=\"[ə&#39;nætəmɪ]\">[ə&#39;nætəmɪ]</div></td>\\n                        <td width=\"320px\">\\n                            <div  class=\"desc\"  title=\"n. 解剖；解剖学；剖析；骨骼\">n. 解剖；解剖学；剖析；骨骼</div>\\n                        </td>\\n                        <!-- <td width=\"50px\">\\n                            <span class=\"flag\" style=\"display:none;\">0</span>\\n                            <span class=\"level\">\\n                                                        </span>\\n                        </td> -->\\n\\n                        <td width=\"85px\">2017-7-17</td>\\n                        <td >\\n                            <div  class=\"tags\" title=\"\"></div>\\n                        </td>\\n                        <td width=\"65px\" style=\"vertical-align:middle;\">\\n                            <a href=\"#\" class=\"editword\"  title=\"编辑anatomy\" ></a>\\n                            \\n                           \\n                            <a href=\\n                                                        \"wordlist?action=delete&word=anatomy&p=0\" \\n                                                        class=\"deleteword\" title=\"删除anatomy\" onclick=\\'if(!confirm(\"您确定删除单词 anatomy 吗？\")){ return false;}else return true;\\'></a>\\n                        </td>\\n                    </tr>\\n                                        <tr>\\n                        <td width=\"50px\"> 3</td>\\n                        <td width=\"80px\"><div class=\"word\"  title=\"backbone\"><a href=\"/search?keyfrom=webwordbook&q=backbone\"  target=\"_blank\"><strong>backbone</strong></a></div></td>\\n                        <td width=\"80px\"><div class=\"phonetic\"  title=\"[&#39;bækbəʊn]\">[&#39;bækbəʊn]</div></td>\\n                        <td width=\"320px\">\\n                            <div  class=\"desc\"  title=\"n. 支柱;主干网;决心,毅力;脊椎\">n. 支柱;主干网;决心,毅力;脊椎</div>\\n                        </td>\\n                        <!-- <td width=\"50px\">\\n                            <span class=\"flag\" style=\"display:none;\">0</span>\\n                            <span class=\"level\">\\n                                                        </span>\\n                        </td> -->\\n\\n                        <td width=\"85px\">2017-7-13</td>\\n                        <td >\\n                            <div  class=\"tags\" title=\"\"></div>\\n                        </td>\\n                        <td width=\"65px\" style=\"vertical-align:middle;\">\\n                            <a href=\"#\" class=\"editword\"  title=\"编辑backbone\" ></a>\\n                            \\n                           \\n                            <a href=\\n                                                        \"wordlist?action=delete&word=backbone&p=0\" \\n                                                        class=\"deleteword\" title=\"删除backbone\" onclick=\\'if(!confirm(\"您确定删除单词 backbone 吗？\")){ return false;}else return true;\\'></a>\\n                        </td>\\n                    </tr>\\n                                        <tr>\\n                        <td width=\"50px\"> 4</td>\\n                        <td width=\"80px\"><div class=\"word\"  title=\"ballpark\"><a href=\"/search?keyfrom=webwordbook&q=ballpark\"  target=\"_blank\"><strong>ballpark</strong></a></div></td>\\n                        <td width=\"80px\"><div class=\"phonetic\"  title=\"[&#39;bɔːlpɑːk]\">[&#39;bɔːlpɑːk]</div></td>\\n                        <td width=\"320px\">\\n                            <div  class=\"desc\"  title=\"n. (美)棒球场;活动领域;可变通范围\\nadj. 大约的\">n. (美)棒球场;活动领域;可变通范围\\nadj. 大约的</div>\\n                        </td>\\n                        <!-- <td width=\"50px\">\\n                            <span class=\"flag\" style=\"display:none;\">0</span>\\n                            <span class=\"level\">\\n                                                        </span>\\n                        </td> -->\\n\\n                        <td width=\"85px\">2019-10-16</td>\\n                        <td >\\n                            <div  class=\"tags\" title=\"\"></div>\\n                        </td>\\n                        <td width=\"65px\" style=\"vertical-align:middle;\">\\n                            <a href=\"#\" class=\"editword\"  title=\"编辑ballpark\" ></a>\\n                            \\n                           \\n                            <a href=\\n                                                        \"wordlist?action=delete&word=ballpark&p=0\" \\n                                                        class=\"deleteword\" title=\"删除ballpark\" onclick=\\'if(!confirm(\"您确定删除单词 ballpark 吗？\")){ return false;}else return true;\\'></a>\\n                        </td>\\n                    </tr>\\n                                        <tr>\\n                        <td width=\"50px\"> 5</td>\\n                        <td width=\"80px\"><div class=\"word\"  title=\"bilingual\"><a href=\"/search?keyfrom=webwordbook&q=bilingual\"  target=\"_blank\"><strong>bilingual</strong></a></div></td>\\n                        <td width=\"80px\"><div class=\"phonetic\"  title=\"[baɪ&#39;lɪŋgw(ə)l]\">[baɪ&#39;lɪŋgw(ə)l]</div></td>\\n                        <td width=\"320px\">\\n                            <div  class=\"desc\"  title=\"adj. 双语的\\nn. 通两种语言的人\">adj. 双语的\\nn. 通两种语言的人</div>\\n                        </td>\\n                        <!-- <td width=\"50px\">\\n                            <span class=\"flag\" style=\"display:none;\">0</span>\\n                            <span class=\"level\">\\n                                                        </span>\\n                        </td> -->\\n\\n                        <td width=\"85px\">2019-10-15</td>\\n                        <td >\\n                            <div  class=\"tags\" title=\"\"></div>\\n                        </td>\\n                        <td width=\"65px\" style=\"vertical-align:middle;\">\\n                            <a href=\"#\" class=\"editword\"  title=\"编辑bilingual\" ></a>\\n                            \\n                           \\n                            <a href=\\n                                                        \"wordlist?action=delete&word=bilingual&p=0\" \\n                                                        class=\"deleteword\" title=\"删除bilingual\" onclick=\\'if(!confirm(\"您确定删除单词 bilingual 吗？\")){ return false;}else return true;\\'></a>\\n                        </td>\\n                    </tr>\\n                                        <tr>\\n                        <td width=\"50px\"> 6</td>\\n                        <td width=\"80px\"><div class=\"word\"  title=\"canonical\"><a href=\"/search?keyfrom=webwordbook&q=canonical\"  target=\"_blank\"><strong>canonical</strong></a></div></td>\\n                        <td width=\"80px\"><div class=\"phonetic\"  title=\"[kə&#39;nɒnɪk(ə)l]\">[kə&#39;nɒnɪk(ə)l]</div></td>\\n                        <td width=\"320px\">\\n                            <div  class=\"desc\"  title=\"adj. 依教规的;权威的;牧师的\\nn. 牧师礼服\">adj. 依教规的;权威的;牧师的\\nn. 牧师礼服</div>\\n                        </td>\\n                        <!-- <td width=\"50px\">\\n                            <span class=\"flag\" style=\"display:none;\">0</span>\\n                            <span class=\"level\">\\n                                                        </span>\\n                        </td> -->\\n\\n                        <td width=\"85px\">2019-10-14</td>\\n                        <td >\\n                            <div  class=\"tags\" title=\"\"></div>\\n                        </td>\\n                        <td width=\"65px\" style=\"vertical-align:middle;\">\\n                            <a href=\"#\" class=\"editword\"  title=\"编辑canonical\" ></a>\\n                            \\n                           \\n                            <a href=\\n                                                        \"wordlist?action=delete&word=canonical&p=0\" \\n                                                        class=\"deleteword\" title=\"删除canonical\" onclick=\\'if(!confirm(\"您确定删除单词 canonical 吗？\")){ return false;}else return true;\\'></a>\\n                        </td>\\n                    </tr>\\n                                        <tr>\\n                        <td width=\"50px\"> 7</td>\\n                        <td width=\"80px\"><div class=\"word\"  title=\"cater\"><a href=\"/search?keyfrom=webwordbook&q=cater\"  target=\"_blank\"><strong>cater</strong></a></div></td>\\n                        <td width=\"80px\"><div class=\"phonetic\"  title=\"[&#39;keɪtə]\">[&#39;keɪtə]</div></td>\\n                        <td width=\"320px\">\\n                            <div  class=\"desc\"  title=\"vt. 投合，迎合；满足需要；提供饮食及服务\\nn. (Cater)人名；(英)凯特\">vt. 投合，迎合；满足需要；提供饮食及服务\\nn. (Cater)人名；(英)凯特</div>\\n                        </td>\\n                        <!-- <td width=\"50px\">\\n                            <span class=\"flag\" style=\"display:none;\">0</span>\\n                            <span class=\"level\">\\n                                                        </span>\\n                        </td> -->\\n\\n                        <td width=\"85px\">2017-7-17</td>\\n                        <td >\\n                            <div  class=\"tags\" title=\"\"></div>\\n                        </td>\\n                        <td width=\"65px\" style=\"vertical-align:middle;\">\\n                            <a href=\"#\" class=\"editword\"  title=\"编辑cater\" ></a>\\n                            \\n                           \\n                            <a href=\\n                                                        \"wordlist?action=delete&word=cater&p=0\" \\n                                                        class=\"deleteword\" title=\"删除cater\" onclick=\\'if(!confirm(\"您确定删除单词 cater 吗？\")){ return false;}else return true;\\'></a>\\n                        </td>\\n                    </tr>\\n                                        <tr>\\n                        <td width=\"50px\"> 8</td>\\n                        <td width=\"80px\"><div class=\"word\"  title=\"clarity\"><a href=\"/search?keyfrom=webwordbook&q=clarity\"  target=\"_blank\"><strong>clarity</strong></a></div></td>\\n                        <td width=\"80px\"><div class=\"phonetic\"  title=\"[&#39;klærɪtɪ]\">[&#39;klærɪtɪ]</div></td>\\n                        <td width=\"320px\">\\n                            <div  class=\"desc\"  title=\"n. 清楚,明晰;透明\\nn. (Clarity)人名;(英)克拉里蒂\">n. 清楚,明晰;透明\\nn. (Clarity)人名;(英)克拉里蒂</div>\\n                        </td>\\n                        <!-- <td width=\"50px\">\\n                            <span class=\"flag\" style=\"display:none;\">0</span>\\n                            <span class=\"level\">\\n                                                        </span>\\n                        </td> -->\\n\\n                        <td width=\"85px\">2019-10-16</td>\\n                        <td >\\n                            <div  class=\"tags\" title=\"\"></div>\\n                        </td>\\n                        <td width=\"65px\" style=\"vertical-align:middle;\">\\n                            <a href=\"#\" class=\"editword\"  title=\"编辑clarity\" ></a>\\n                            \\n                           \\n                            <a href=\\n                                                        \"wordlist?action=delete&word=clarity&p=0\" \\n                                                        class=\"deleteword\" title=\"删除clarity\" onclick=\\'if(!confirm(\"您确定删除单词 clarity 吗？\")){ return false;}else return true;\\'></a>\\n                        </td>\\n                    </tr>\\n                                        <tr>\\n                        <td width=\"50px\"> 9</td>\\n                        <td width=\"80px\"><div class=\"word\"  title=\"compression\"><a href=\"/search?keyfrom=webwordbook&q=compression\"  target=\"_blank\"><strong>compression</strong></a></div></td>\\n                        <td width=\"80px\"><div class=\"phonetic\"  title=\"[kəm&#39;preʃ(ə)n]\">[kəm&#39;preʃ(ə)n]</div></td>\\n                        <td width=\"320px\">\\n                            <div  class=\"desc\"  title=\"n. 压缩,浓缩;压榨,压迫\">n. 压缩,浓缩;压榨,压迫</div>\\n                        </td>\\n                        <!-- <td width=\"50px\">\\n                            <span class=\"flag\" style=\"display:none;\">0</span>\\n                            <span class=\"level\">\\n                                                        </span>\\n                        </td> -->\\n\\n                        <td width=\"85px\">2019-10-15</td>\\n                        <td >\\n                            <div  class=\"tags\" title=\"\"></div>\\n                        </td>\\n                        <td width=\"65px\" style=\"vertical-align:middle;\">\\n                            <a href=\"#\" class=\"editword\"  title=\"编辑compression\" ></a>\\n                            \\n                           \\n                            <a href=\\n                                                        \"wordlist?action=delete&word=compression&p=0\" \\n                                                        class=\"deleteword\" title=\"删除compression\" onclick=\\'if(!confirm(\"您确定删除单词 compression 吗？\")){ return false;}else return true;\\'></a>\\n                        </td>\\n                    </tr>\\n                                        <tr>\\n                        <td width=\"50px\"> 10</td>\\n                        <td width=\"80px\"><div class=\"word\"  title=\"contaminated\"><a href=\"/search?keyfrom=webwordbook&q=contaminated\"  target=\"_blank\"><strong>contaminated</strong></a></div></td>\\n                        <td width=\"80px\"><div class=\"phonetic\"  title=\"\"></div></td>\\n                        <td width=\"320px\">\\n                            <div  class=\"desc\"  title=\"adj. 受污染的，弄脏的 v. 污染；玷污，毒害（contaminate 的过去式和过去分词）\">adj. 受污染的，弄脏的 v. 污染；玷污，毒害（contaminate 的过去式和过去分词）</div>\\n                        </td>\\n                        <!-- <td width=\"50px\">\\n                            <span class=\"flag\" style=\"display:none;\">0</span>\\n                            <span class=\"level\">\\n                                                        </span>\\n                        </td> -->\\n\\n                        <td width=\"85px\">2020-1-13</td>\\n                        <td >\\n                            <div  class=\"tags\" title=\"\"></div>\\n                        </td>\\n                        <td width=\"65px\" style=\"vertical-align:middle;\">\\n                            <a href=\"#\" class=\"editword\"  title=\"编辑contaminated\" ></a>\\n                            \\n                           \\n                            <a href=\\n                                                        \"wordlist?action=delete&word=contaminated&p=0\" \\n                                                        class=\"deleteword\" title=\"删除contaminated\" onclick=\\'if(!confirm(\"您确定删除单词 contaminated 吗？\")){ return false;}else return true;\\'></a>\\n                        </td>\\n                    </tr>\\n                                        <tr>\\n                        <td width=\"50px\"> 11</td>\\n                        <td width=\"80px\"><div class=\"word\"  title=\"counterparts\"><a href=\"/search?keyfrom=webwordbook&q=counterparts\"  target=\"_blank\"><strong>counterparts</strong></a></div></td>\\n                        <td width=\"80px\"><div class=\"phonetic\"  title=\"[]\">[]</div></td>\\n                        <td width=\"320px\">\\n                            <div  class=\"desc\"  title=\"n. （契约）副本（counterpart的复数）；相对物；相对应的人\">n. （契约）副本（counterpart的复数）；相对物；相对应的人</div>\\n                        </td>\\n                        <!-- <td width=\"50px\">\\n                            <span class=\"flag\" style=\"display:none;\">0</span>\\n                            <span class=\"level\">\\n                                                        </span>\\n                        </td> -->\\n\\n                        <td width=\"85px\">2017-7-16</td>\\n                        <td >\\n                            <div  class=\"tags\" title=\"\"></div>\\n                        </td>\\n                        <td width=\"65px\" style=\"vertical-align:middle;\">\\n                            <a href=\"#\" class=\"editword\"  title=\"编辑counterparts\" ></a>\\n                            \\n                           \\n                            <a href=\\n                                                        \"wordlist?action=delete&word=counterparts&p=0\" \\n                                                        class=\"deleteword\" title=\"删除counterparts\" onclick=\\'if(!confirm(\"您确定删除单词 counterparts 吗？\")){ return false;}else return true;\\'></a>\\n                        </td>\\n                    </tr>\\n                                        <tr>\\n                        <td width=\"50px\"> 12</td>\\n                        <td width=\"80px\"><div class=\"word\"  title=\"criteria\"><a href=\"/search?keyfrom=webwordbook&q=criteria\"  target=\"_blank\"><strong>criteria</strong></a></div></td>\\n                        <td width=\"80px\"><div class=\"phonetic\"  title=\"[kraɪ&#39;tɪərɪə]\">[kraɪ&#39;tɪərɪə]</div></td>\\n                        <td width=\"320px\">\\n                            <div  class=\"desc\"  title=\"n. 标准，条件（criterion的复数）\">n. 标准，条件（criterion的复数）</div>\\n                        </td>\\n                        <!-- <td width=\"50px\">\\n                            <span class=\"flag\" style=\"display:none;\">0</span>\\n                            <span class=\"level\">\\n                                                        </span>\\n                        </td> -->\\n\\n                        <td width=\"85px\">2017-7-6</td>\\n                        <td >\\n                            <div  class=\"tags\" title=\"\"></div>\\n                        </td>\\n                        <td width=\"65px\" style=\"vertical-align:middle;\">\\n                            <a href=\"#\" class=\"editword\"  title=\"编辑criteria\" ></a>\\n                            \\n                           \\n                            <a href=\\n                                                        \"wordlist?action=delete&word=criteria&p=0\" \\n                                                        class=\"deleteword\" title=\"删除criteria\" onclick=\\'if(!confirm(\"您确定删除单词 criteria 吗？\")){ return false;}else return true;\\'></a>\\n                        </td>\\n                    </tr>\\n                                        <tr>\\n                        <td width=\"50px\"> 13</td>\\n                        <td width=\"80px\"><div class=\"word\"  title=\"crunch\"><a href=\"/search?keyfrom=webwordbook&q=crunch\"  target=\"_blank\"><strong>crunch</strong></a></div></td>\\n                        <td width=\"80px\"><div class=\"phonetic\"  title=\"[krʌntʃ]\">[krʌntʃ]</div></td>\\n                        <td width=\"320px\">\\n                            <div  class=\"desc\"  title=\"n.咬碎，咬碎声；扎扎地踏\\nvt.压碎；嘎扎嘎扎的咬嚼；扎扎地踏过\\nvi.嘎吱作响地咀嚼；嘎吱嘎吱地踏过\">n.咬碎，咬碎声；扎扎地踏\\nvt.压碎；嘎扎嘎扎的咬嚼；扎扎地踏过\\nvi.嘎吱作响地咀嚼；嘎吱嘎吱地踏过</div>\\n                        </td>\\n                        <!-- <td width=\"50px\">\\n                            <span class=\"flag\" style=\"display:none;\">0</span>\\n                            <span class=\"level\">\\n                                                        </span>\\n                        </td> -->\\n\\n                        <td width=\"85px\">2019-10-8</td>\\n                        <td >\\n                            <div  class=\"tags\" title=\"\"></div>\\n                        </td>\\n                        <td width=\"65px\" style=\"vertical-align:middle;\">\\n                            <a href=\"#\" class=\"editword\"  title=\"编辑crunch\" ></a>\\n                            \\n                           \\n                            <a href=\\n                                                        \"wordlist?action=delete&word=crunch&p=0\" \\n                                                        class=\"deleteword\" title=\"删除crunch\" onclick=\\'if(!confirm(\"您确定删除单词 crunch 吗？\")){ return false;}else return true;\\'></a>\\n                        </td>\\n                    </tr>\\n                                        <tr>\\n                        <td width=\"50px\"> 14</td>\\n                        <td width=\"80px\"><div class=\"word\"  title=\"delighted\"><a href=\"/search?keyfrom=webwordbook&q=delighted\"  target=\"_blank\"><strong>delighted</strong></a></div></td>\\n                        <td width=\"80px\"><div class=\"phonetic\"  title=\"[dɪ&#39;laɪtɪd]\">[dɪ&#39;laɪtɪd]</div></td>\\n                        <td width=\"320px\">\\n                            <div  class=\"desc\"  title=\"adj. 高兴的;欣喜的\\nv. 使…兴高采烈;感到快乐(delight的过去分词)\">adj. 高兴的;欣喜的\\nv. 使…兴高采烈;感到快乐(delight的过去分词)</div>\\n                        </td>\\n                        <!-- <td width=\"50px\">\\n                            <span class=\"flag\" style=\"display:none;\">0</span>\\n                            <span class=\"level\">\\n                                                        </span>\\n                        </td> -->\\n\\n                        <td width=\"85px\">2019-10-16</td>\\n                        <td >\\n                            <div  class=\"tags\" title=\"\"></div>\\n                        </td>\\n                        <td width=\"65px\" style=\"vertical-align:middle;\">\\n                            <a href=\"#\" class=\"editword\"  title=\"编辑delighted\" ></a>\\n                            \\n                           \\n                            <a href=\\n                                                        \"wordlist?action=delete&word=delighted&p=0\" \\n                                                        class=\"deleteword\" title=\"删除delighted\" onclick=\\'if(!confirm(\"您确定删除单词 delighted 吗？\")){ return false;}else return true;\\'></a>\\n                        </td>\\n                    </tr>\\n                                        <tr>\\n                        <td width=\"50px\"> 15</td>\\n                        <td width=\"80px\"><div class=\"word\"  title=\"denominator\"><a href=\"/search?keyfrom=webwordbook&q=denominator\"  target=\"_blank\"><strong>denominator</strong></a></div></td>\\n                        <td width=\"80px\"><div class=\"phonetic\"  title=\"\"></div></td>\\n                        <td width=\"320px\">\\n                            <div  class=\"desc\"  title=\"n. [数] 分母；命名者；共同特征或共同性质；平均水平或标准\">n. [数] 分母；命名者；共同特征或共同性质；平均水平或标准</div>\\n                        </td>\\n                        <!-- <td width=\"50px\">\\n                            <span class=\"flag\" style=\"display:none;\">0</span>\\n                            <span class=\"level\">\\n                                                        </span>\\n                        </td> -->\\n\\n                        <td width=\"85px\">2020-1-13</td>\\n                        <td >\\n                            <div  class=\"tags\" title=\"\"></div>\\n                        </td>\\n                        <td width=\"65px\" style=\"vertical-align:middle;\">\\n                            <a href=\"#\" class=\"editword\"  title=\"编辑denominator\" ></a>\\n                            \\n                           \\n                            <a href=\\n                                                        \"wordlist?action=delete&word=denominator&p=0\" \\n                                                        class=\"deleteword\" title=\"删除denominator\" onclick=\\'if(!confirm(\"您确定删除单词 denominator 吗？\")){ return false;}else return true;\\'></a>\\n                        </td>\\n                    </tr>\\n                                    </tbody>\\n            </table>\\n        </div>\\n      \\n  \\n        <div id=\"wordfoot\" >\\n            \\n                                <div id=\"pagination\">\\n                                                             <span class=\"current-page\">1 </span>\\n                    \\n                    \\n                                                                                                                                                                                                                                                    <a href=\"wordlist?p=1&tags=\">2</a> \\n                                                                                                                                                                                                                                                                                                                                                                                            <a href=\"wordlist?p=2&tags=\">3</a> \\n                                                                                                                                                                                                                                                                                                        <span style=\"border:none;\">...</span>\\n                                \\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    \\n                                        <a href=\"wordlist?p=1&tags=\" class=\"next-page\">下一页</a>\\n                                        <a href=\"wordlist?p=7&tags=\" class=\"next-page\">最后一页</a>\\n                </div>\\n               <form id=\"pagejumpform\" action=\"#\">\\n               跳至第<input type=\"text\" value=\"\"/>页<button type=\"submit\">确定</button>\\n               </form>                \\n                              \\n\\n               \\n             \\n             \\n             <div class=\"right\" >当前分类：<strong> 全部分类 </strong> &nbsp;&nbsp;共计 <strong>86</strong> 个单词 </div>\\n             <div class=\"clear\"></div>\\n        </div>\\n            </div>\\n    \\n\\n    \\n    <div id=\"cardmode\">\\n          <div id=\"cardmode-wrap\">\\n        <div id=\"card\">\\n                                            <h1 ><span id=\"card_word\">agglomerative</span><a href=\"#\" id=\"phonetic-voice\"></a></h1> \\n                <div id=\"card_pronounce\">\\n                    \\n                </div>\\n\\n                <div id=\"description\" style=\"display:none;\">\\n                    adj. 会凝聚的；[冶] 烧结的，凝结的\\n                </div>\\n\\n                <div id=\"mask\" >\\n                    <span id=\"toggle-description\" ><img src=\"http://shared.ydstatic.com/dict/wordbook-v1/images/mask.png\"></span>\\n                </div>\\n            \\n                <div id=\"action\">\\n                    <a id=\"pre\" href=\"#\"></a>\\n                    <a id=\"next\" href=\"#\"></a>\\n                    <div style=\"clear:both;\"></div>\\n                </div>\\n                \\n                                                                                                                                                                                                                                                                                                                                                                                                                                            </div>\\n      </div>\\n        <div style=\"line-height:28px;text-align:right;\">\\n            当前分类：<strong> 全部分类 </strong> &nbsp;&nbsp;共计 \\n            <strong id=\"card_max_id\">86</strong> 个单词 现在是第<span id=\"card_id\"> 1</span>个\\n        </div>\\n              \\n    </div>\\n    \\n\\n\\n\\n\\n\\n<div id=\"footarea\" >\\n    <div style=\" line-height:2; margin:10px 0 20px;\">更好的进行生词的整理/记忆，请使用桌面版和手机版有道词典中的单词本</div>\\n    <div id=\"foot-ad\">\\n    \\n        <a href=\"http://cidian.youdao.com/?keyform=webwordbook\" class=\"go-to-desktop\" target=\"_blank\"></a>\\n        <a href=\"http://cidian.youdao.com/android.html?keyform=webwordbook\" class=\"go-to-mobile\" target=\"_blank\"></a>\\n\\n    </div>\\n</div>   \\n\\n</div>\\n\\n<div id=\"bottom\">\\n  <p><a href=\"http://youdao.com/\">有道首页</a> - <a href=\"http://www.youdao.com/help/dict/description/001/\">帮助</a> - <a href=\"http://www.youdao.com/about/\">关于有道</a> - <a href=\"http://i.youdao.com/\">官方博客</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&copy; 2020 网易公司 京ICP证080268号</p>\\n  \\n</div>\\n\\n\\n\\n    <div id=\"editwordform\">\\n        <h1>danci</h1>\\n        <a href=\"#\" id=\"close-editwordform\"></a>\\n        <form method=\"post\" action=\"wordlist?action=modify\">\\n        \\n        <label for=\"word\">单词<span id=\"waittext\"></span></label>\\n        <input id=\"word\" type=\"text\" value=\"\" name=\"word\" autocomplete=\"off\" />\\n        <label for=\"phonetic\">音标</label>\\n        <input id=\"phonetic\" type=\"text\" value=\"\" name=\"phonetic\" />\\n        <label for=\"desc\">解释</label>\\n        <textarea id=\"desc\" name=\"desc\" ></textarea>\\n        \\n        <label style=\"color:blue;\">更多（可不填）</label>\\n\\n        <label for=\"tags\">分类</label><input id=\"tags\" type=\"text\" value=\"\" name=\"tags\" autocomplete=\"off\" />\\n        <ul id=\"tag-select-list\">\\n                                            <li>无标签</li>\\n                                    </ul>\\n            \\n        <div class=\"center-content\"><button type=\"submit\"></button></div>\\n        </form>\\n    </div>        \\n\\n<div id=\"leftbar\">\\n<a href=\"/?keyfrom=webwordbook\">返回词典首页</a>\\n<br/><br/>\\n<a href=\"http://xue.youdao.com/\">返回有道学堂</a>\\n</div>    \\n    <object width=\"1\" height=\"1\" type=\"application/x-shockwave-flash\" id=\"dictVoice\" data=\"/dictVoice.swf\">\\n        <param name=\"movie\" value=\"/dictVoice.swf\"/>\\n        <param name=\"menu\" value=\"false\"/>\\n        <param name=\"allowScriptAccess\" value=\"always\"/>\\n        <param name=\"wmode\" value=\"transparent\"/>\\n    </object>\\n    \\n<script type=\"text/javascript\" src=\"http://shared.ydstatic.com/dict/wordbook-v1/scripts/jquery-1.5.2.min.js\"></script>\\n<script type=\"text/javascript\" src=\"http://shared.ydstatic.com/dict/wordbook-v1/scripts/jquery.extention.dict4.js\"></script>\\n<script type=\"text/javascript\" src=\"http://shared.ydstatic.com/dict/wordbook-v1/scripts/navigatorBar.js\"></script>\\n<script type=\"text/javascript\" src=\"resources/scripts/main.js\"></script>\\n</body>\\n</html>\\n'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "htmls[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. 使用Pandas解析网页中的表格"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_html(htmls[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2\n",
      "<class 'list'>\n"
     ]
    }
   ],
   "source": [
    "print(len(df))\n",
    "print(type(df))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>序号</th>\n",
       "      <th>单词</th>\n",
       "      <th>音标</th>\n",
       "      <th>解释</th>\n",
       "      <th>时间</th>\n",
       "      <th>分类</th>\n",
       "      <th>操作</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [序号, 单词, 音标, 解释, 时间, 分类, 操作]\n",
       "Index: []"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[0].head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>agglomerative</td>\n",
       "      <td>NaN</td>\n",
       "      <td>adj. 会凝聚的；[冶] 烧结的，凝结的</td>\n",
       "      <td>2020-1-13</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>anatomy</td>\n",
       "      <td>[ə'nætəmɪ]</td>\n",
       "      <td>n. 解剖；解剖学；剖析；骨骼</td>\n",
       "      <td>2017-7-17</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>backbone</td>\n",
       "      <td>['bækbəʊn]</td>\n",
       "      <td>n. 支柱;主干网;决心,毅力;脊椎</td>\n",
       "      <td>2017-7-13</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0              1           2                      3          4   5   6\n",
       "0  1  agglomerative         NaN  adj. 会凝聚的；[冶] 烧结的，凝结的  2020-1-13 NaN NaN\n",
       "1  2        anatomy  [ə'nætəmɪ]        n. 解剖；解剖学；剖析；骨骼  2017-7-17 NaN NaN\n",
       "2  3       backbone  ['bækbəʊn]     n. 支柱;主干网;决心,毅力;脊椎  2017-7-13 NaN NaN"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[1].head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_cont = df[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_cont.columns = df[0].columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>序号</th>\n",
       "      <th>单词</th>\n",
       "      <th>音标</th>\n",
       "      <th>解释</th>\n",
       "      <th>时间</th>\n",
       "      <th>分类</th>\n",
       "      <th>操作</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>agglomerative</td>\n",
       "      <td>NaN</td>\n",
       "      <td>adj. 会凝聚的；[冶] 烧结的，凝结的</td>\n",
       "      <td>2020-1-13</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>anatomy</td>\n",
       "      <td>[ə'nætəmɪ]</td>\n",
       "      <td>n. 解剖；解剖学；剖析；骨骼</td>\n",
       "      <td>2017-7-17</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>backbone</td>\n",
       "      <td>['bækbəʊn]</td>\n",
       "      <td>n. 支柱;主干网;决心,毅力;脊椎</td>\n",
       "      <td>2017-7-13</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   序号             单词          音标                     解释         时间  分类  操作\n",
       "0   1  agglomerative         NaN  adj. 会凝聚的；[冶] 烧结的，凝结的  2020-1-13 NaN NaN\n",
       "1   2        anatomy  [ə'nætəmɪ]        n. 解剖；解剖学；剖析；骨骼  2017-7-17 NaN NaN\n",
       "2   3       backbone  ['bækbəʊn]     n. 支柱;主干网;决心,毅力;脊椎  2017-7-13 NaN NaN"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_cont.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 收集6个网页的表格\n",
    "df_list = []\n",
    "for html in htmls:\n",
    "    df = pd.read_html(html)\n",
    "    df_cont = df[1]\n",
    "    df_cont.columns = df[0].columns\n",
    "    df_list.append(df_cont)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 合并多个表格\n",
    "df_all = pd.concat(df_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>序号</th>\n",
       "      <th>单词</th>\n",
       "      <th>音标</th>\n",
       "      <th>解释</th>\n",
       "      <th>时间</th>\n",
       "      <th>分类</th>\n",
       "      <th>操作</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>agglomerative</td>\n",
       "      <td>NaN</td>\n",
       "      <td>adj. 会凝聚的；[冶] 烧结的，凝结的</td>\n",
       "      <td>2020-1-13</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>anatomy</td>\n",
       "      <td>[ə'nætəmɪ]</td>\n",
       "      <td>n. 解剖；解剖学；剖析；骨骼</td>\n",
       "      <td>2017-7-17</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>backbone</td>\n",
       "      <td>['bækbəʊn]</td>\n",
       "      <td>n. 支柱;主干网;决心,毅力;脊椎</td>\n",
       "      <td>2017-7-13</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   序号             单词          音标                     解释         时间  分类  操作\n",
       "0   1  agglomerative         NaN  adj. 会凝聚的；[冶] 烧结的，凝结的  2020-1-13 NaN NaN\n",
       "1   2        anatomy  [ə'nætəmɪ]        n. 解剖；解剖学；剖析；骨骼  2017-7-17 NaN NaN\n",
       "2   3       backbone  ['bækbəʊn]     n. 支柱;主干网;决心,毅力;脊椎  2017-7-13 NaN NaN"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_all.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(86, 7)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_all.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. 将结果数据输出到Excel文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_all[[\"单词\", \"音标\", \"解释\"]].to_excel(\"./course_datas/c32_read_html/网易有道单词本列表.xlsx\", index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
